Bi-calibration Networks for Weakly-Supervised Video Representation Learning
نویسندگان
چکیده
The leverage of large volumes web videos paired with the query (short phrase for searching video) or surrounding text (long textual description, e.g., video title) offers an economic and extensible alternative to supervised representation learning. Nevertheless, modeling such weakly visual-textual connection is not trivial due polysemy (i.e., many possible meanings a query) isomorphism same syntactic structure different text). In this paper, we introduce new design mutual calibration between achieve more reliable supervision Specifically, present Bi-Calibration Networks (BCN) that novelly couples two calibrations learn correction from vice versa. Technically, BCN executes clustering on all titles searched by identical takes centroid each cluster as prototype. All queries constitute set. learning then formulated classification over prototypes queries, text-to-query query-to-text calibrations. A selection scheme also devised balance Two large-scale datasets title, named YOVO-3M YOVO-10M, are newly collected weakly-supervised feature features ResNet backbone learnt (3M YouTube videos) obtain superior results under linear protocol action recognition. More remarkably, trained larger set YOVO-10M (10M further fine-tuning leads 1.3% gain in top-1 accuracy Kinetics-400 dataset state-of-the-art TAda2D method ImageNet pre-training. Source code available at https://github.com/FuchenUSTC/BCN .
منابع مشابه
Weakly Supervised Learning of Object Segmentations from Web-Scale Video
We propose to learn pixel-level segmentations of objects from weakly labeled (tagged) internet videos. Specifically, given a large collection of raw YouTube content, along with potentially noisy tags, our goal is to automatically generate spatiotemporal masks for each object, such as “dog”, without employing any pre-trained object detectors. We formulate this problem as learning weakly supervis...
متن کاملWeakly-supervised Dictionary Learning
We present a probabilistic modeling and inference framework for discriminative analysis dictionary learning under a weak supervision setting. Dictionary learning approaches have been widely used for tasks such as low-level signal denoising and restoration as well as high-level classification tasks, which can be applied to audio and image analysis. Synthesis dictionary learning aims at jointly l...
متن کاملWeakly Supervised Memory Networks
We introduce a neural network with a recurrent attention model over a possibly large external memory. The architecture is a form of Memory Network [23] but unlike the model in that work, it is trained end-to-end, and hence requires significantly less supervision during training, making it more generally applicable in realistic settings. It can also be seen as an extension of RNNsearch [2] to th...
متن کاملBandit Label Inference for Weakly Supervised Learning
The scarcity of data annotated at the desired level of granularity is a recurring issue in many applications. Significant amounts of effort have been devoted to developing weakly supervised methods tailored to each individual setting, which are often carefully designed to take advantage of the particular properties of weak supervision regimes, form of available data and prior knowledge of the t...
متن کاملDiscrete-Continuous Splitting for Weakly Supervised Learning
This paper introduces a novel algorithm for transductive inference in higher-order MRFs, where the unary energies are parameterized by a variable classifier. The considered task is posed as a joint optimization problem in the continuous classifier parameters and the discrete label variables. In contrast to prior approaches such as convex relaxations, we propose an advantageous decoupling of the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: International Journal of Computer Vision
سال: 2023
ISSN: ['0920-5691', '1573-1405']
DOI: https://doi.org/10.1007/s11263-023-01779-w